Juan Luis Jurado Esteve
Julia Gómez Concejo
import pandas as pd
import numpy as np
1. (0.4 points) Explore your data and do a simplified EDA, mainly in order to determine how many features and how many instances there are, which variables are categorical / numerical, which features have missing values and how many, whether there are constant columns (that should be removed), and whether it is a regression or classification problem (energy is the response variable). You might want to explore other issues you find interesting, but bear in mind that in this assignment EDA is only 0.4 points.
wind_ava = pd.read_csv('wind_available.csv.gzip', compression='gzip')
nrows = wind_ava.shape[0]
ncol = wind_ava.shape[1]
print('There are',nrows,'instances and',ncol,'features.')
column_names_df = pd.DataFrame(wind_ava.columns)
print(column_names_df.to_string())
There are 4748 instances and 555 features.
0
0 energy
1 year
2 month
3 day
4 hour
5 p54.162.1
6 p54.162.2
7 p54.162.3
8 p54.162.4
9 p54.162.5
10 p54.162.6
11 p54.162.7
12 p54.162.8
13 p54.162.9
14 p54.162.10
15 p54.162.11
16 p54.162.12
17 p54.162.13
18 p54.162.14
19 p54.162.15
20 p54.162.16
21 p54.162.17
22 p54.162.18
23 p54.162.19
24 p54.162.20
25 p54.162.21
26 p54.162.22
27 p54.162.23
28 p54.162.24
29 p54.162.25
30 p55.162.1
31 p55.162.2
32 p55.162.3
33 p55.162.4
34 p55.162.5
35 p55.162.6
36 p55.162.7
37 p55.162.8
38 p55.162.9
39 p55.162.10
40 p55.162.11
41 p55.162.12
42 p55.162.13
43 p55.162.14
44 p55.162.15
45 p55.162.16
46 p55.162.17
47 p55.162.18
48 p55.162.19
49 p55.162.20
50 p55.162.21
51 p55.162.22
52 p55.162.23
53 p55.162.24
54 p55.162.25
55 cape.1
56 cape.2
57 cape.3
58 cape.4
59 cape.5
60 cape.6
61 cape.7
62 cape.8
63 cape.9
64 cape.10
65 cape.11
66 cape.12
67 cape.13
68 cape.14
69 cape.15
70 cape.16
71 cape.17
72 cape.18
73 cape.19
74 cape.20
75 cape.21
76 cape.22
77 cape.23
78 cape.24
79 cape.25
80 p59.162.1
81 p59.162.2
82 p59.162.3
83 p59.162.4
84 p59.162.5
85 p59.162.6
86 p59.162.7
87 p59.162.8
88 p59.162.9
89 p59.162.10
90 p59.162.11
91 p59.162.12
92 p59.162.13
93 p59.162.14
94 p59.162.15
95 p59.162.16
96 p59.162.17
97 p59.162.18
98 p59.162.19
99 p59.162.20
100 p59.162.21
101 p59.162.22
102 p59.162.23
103 p59.162.24
104 p59.162.25
105 lai_lv.1
106 lai_lv.2
107 lai_lv.3
108 lai_lv.4
109 lai_lv.5
110 lai_lv.6
111 lai_lv.7
112 lai_lv.8
113 lai_lv.9
114 lai_lv.10
115 lai_lv.11
116 lai_lv.12
117 lai_lv.13
118 lai_lv.14
119 lai_lv.15
120 lai_lv.16
121 lai_lv.17
122 lai_lv.18
123 lai_lv.19
124 lai_lv.20
125 lai_lv.21
126 lai_lv.22
127 lai_lv.23
128 lai_lv.24
129 lai_lv.25
130 lai_hv.1
131 lai_hv.2
132 lai_hv.3
133 lai_hv.4
134 lai_hv.5
135 lai_hv.6
136 lai_hv.7
137 lai_hv.8
138 lai_hv.9
139 lai_hv.10
140 lai_hv.11
141 lai_hv.12
142 lai_hv.13
143 lai_hv.14
144 lai_hv.15
145 lai_hv.16
146 lai_hv.17
147 lai_hv.18
148 lai_hv.19
149 lai_hv.20
150 lai_hv.21
151 lai_hv.22
152 lai_hv.23
153 lai_hv.24
154 lai_hv.25
155 u10n.1
156 u10n.2
157 u10n.3
158 u10n.4
159 u10n.5
160 u10n.6
161 u10n.7
162 u10n.8
163 u10n.9
164 u10n.10
165 u10n.11
166 u10n.12
167 u10n.13
168 u10n.14
169 u10n.15
170 u10n.16
171 u10n.17
172 u10n.18
173 u10n.19
174 u10n.20
175 u10n.21
176 u10n.22
177 u10n.23
178 u10n.24
179 u10n.25
180 v10n.1
181 v10n.2
182 v10n.3
183 v10n.4
184 v10n.5
185 v10n.6
186 v10n.7
187 v10n.8
188 v10n.9
189 v10n.10
190 v10n.11
191 v10n.12
192 v10n.13
193 v10n.14
194 v10n.15
195 v10n.16
196 v10n.17
197 v10n.18
198 v10n.19
199 v10n.20
200 v10n.21
201 v10n.22
202 v10n.23
203 v10n.24
204 v10n.25
205 sp.1
206 sp.2
207 sp.3
208 sp.4
209 sp.5
210 sp.6
211 sp.7
212 sp.8
213 sp.9
214 sp.10
215 sp.11
216 sp.12
217 sp.13
218 sp.14
219 sp.15
220 sp.16
221 sp.17
222 sp.18
223 sp.19
224 sp.20
225 sp.21
226 sp.22
227 sp.23
228 sp.24
229 sp.25
230 stl1.1
231 stl1.2
232 stl1.3
233 stl1.4
234 stl1.5
235 stl1.6
236 stl1.7
237 stl1.8
238 stl1.9
239 stl1.10
240 stl1.11
241 stl1.12
242 stl1.13
243 stl1.14
244 stl1.15
245 stl1.16
246 stl1.17
247 stl1.18
248 stl1.19
249 stl1.20
250 stl1.21
251 stl1.22
252 stl1.23
253 stl1.24
254 stl1.25
255 u10.1
256 u10.2
257 u10.3
258 u10.4
259 u10.5
260 u10.6
261 u10.7
262 u10.8
263 u10.9
264 u10.10
265 u10.11
266 u10.12
267 u10.13
268 u10.14
269 u10.15
270 u10.16
271 u10.17
272 u10.18
273 u10.19
274 u10.20
275 u10.21
276 u10.22
277 u10.23
278 u10.24
279 u10.25
280 v10.1
281 v10.2
282 v10.3
283 v10.4
284 v10.5
285 v10.6
286 v10.7
287 v10.8
288 v10.9
289 v10.10
290 v10.11
291 v10.12
292 v10.13
293 v10.14
294 v10.15
295 v10.16
296 v10.17
297 v10.18
298 v10.19
299 v10.20
300 v10.21
301 v10.22
302 v10.23
303 v10.24
304 v10.25
305 t2m.1
306 t2m.2
307 t2m.3
308 t2m.4
309 t2m.5
310 t2m.6
311 t2m.7
312 t2m.8
313 t2m.9
314 t2m.10
315 t2m.11
316 t2m.12
317 t2m.13
318 t2m.14
319 t2m.15
320 t2m.16
321 t2m.17
322 t2m.18
323 t2m.19
324 t2m.20
325 t2m.21
326 t2m.22
327 t2m.23
328 t2m.24
329 t2m.25
330 stl2.1
331 stl2.2
332 stl2.3
333 stl2.4
334 stl2.5
335 stl2.6
336 stl2.7
337 stl2.8
338 stl2.9
339 stl2.10
340 stl2.11
341 stl2.12
342 stl2.13
343 stl2.14
344 stl2.15
345 stl2.16
346 stl2.17
347 stl2.18
348 stl2.19
349 stl2.20
350 stl2.21
351 stl2.22
352 stl2.23
353 stl2.24
354 stl2.25
355 stl3.1
356 stl3.2
357 stl3.3
358 stl3.4
359 stl3.5
360 stl3.6
361 stl3.7
362 stl3.8
363 stl3.9
364 stl3.10
365 stl3.11
366 stl3.12
367 stl3.13
368 stl3.14
369 stl3.15
370 stl3.16
371 stl3.17
372 stl3.18
373 stl3.19
374 stl3.20
375 stl3.21
376 stl3.22
377 stl3.23
378 stl3.24
379 stl3.25
380 iews.1
381 iews.2
382 iews.3
383 iews.4
384 iews.5
385 iews.6
386 iews.7
387 iews.8
388 iews.9
389 iews.10
390 iews.11
391 iews.12
392 iews.13
393 iews.14
394 iews.15
395 iews.16
396 iews.17
397 iews.18
398 iews.19
399 iews.20
400 iews.21
401 iews.22
402 iews.23
403 iews.24
404 iews.25
405 inss.1
406 inss.2
407 inss.3
408 inss.4
409 inss.5
410 inss.6
411 inss.7
412 inss.8
413 inss.9
414 inss.10
415 inss.11
416 inss.12
417 inss.13
418 inss.14
419 inss.15
420 inss.16
421 inss.17
422 inss.18
423 inss.19
424 inss.20
425 inss.21
426 inss.22
427 inss.23
428 inss.24
429 inss.25
430 stl4.1
431 stl4.2
432 stl4.3
433 stl4.4
434 stl4.5
435 stl4.6
436 stl4.7
437 stl4.8
438 stl4.9
439 stl4.10
440 stl4.11
441 stl4.12
442 stl4.13
443 stl4.14
444 stl4.15
445 stl4.16
446 stl4.17
447 stl4.18
448 stl4.19
449 stl4.20
450 stl4.21
451 stl4.22
452 stl4.23
453 stl4.24
454 stl4.25
455 fsr.1
456 fsr.2
457 fsr.3
458 fsr.4
459 fsr.5
460 fsr.6
461 fsr.7
462 fsr.8
463 fsr.9
464 fsr.10
465 fsr.11
466 fsr.12
467 fsr.13
468 fsr.14
469 fsr.15
470 fsr.16
471 fsr.17
472 fsr.18
473 fsr.19
474 fsr.20
475 fsr.21
476 fsr.22
477 fsr.23
478 fsr.24
479 fsr.25
480 flsr.1
481 flsr.2
482 flsr.3
483 flsr.4
484 flsr.5
485 flsr.6
486 flsr.7
487 flsr.8
488 flsr.9
489 flsr.10
490 flsr.11
491 flsr.12
492 flsr.13
493 flsr.14
494 flsr.15
495 flsr.16
496 flsr.17
497 flsr.18
498 flsr.19
499 flsr.20
500 flsr.21
501 flsr.22
502 flsr.23
503 flsr.24
504 flsr.25
505 u100.1
506 u100.2
507 u100.3
508 u100.4
509 u100.5
510 u100.6
511 u100.7
512 u100.8
513 u100.9
514 u100.10
515 u100.11
516 u100.12
517 u100.13
518 u100.14
519 u100.15
520 u100.16
521 u100.17
522 u100.18
523 u100.19
524 u100.20
525 u100.21
526 u100.22
527 u100.23
528 u100.24
529 u100.25
530 v100.1
531 v100.2
532 v100.3
533 v100.4
534 v100.5
535 v100.6
536 v100.7
537 v100.8
538 v100.9
539 v100.10
540 v100.11
541 v100.12
542 v100.13
543 v100.14
544 v100.15
545 v100.16
546 v100.17
547 v100.18
548 v100.19
549 v100.20
550 v100.21
551 v100.22
552 v100.23
553 v100.24
554 v100.25
From the vision of the header of the data frame and the attributes names appendix in the instructions, we conclude that the instances contain the following attributes: energy, year, month, day, hour and, at the different 25 locations, all of the following values in this order: p54.162 (Vertical integral of temperature), p55.162 (Vertical integral of water vapour), cape (Convective available potential energy), p59.162 (Vertical integral of divergence of kinetic energy), lai_lv (Leaf area index, low vegetation), lai_hv (Leaf area index, high vegetation), u10n (Neutral wind at 10 m u-component), v10n (Neutral wind at 10 m v-component), sp (Surface pressure), stl1 (Soil temperature level 1), u10 (10 metre U wind component), v10 (10 metre V wind component), t2m (2 metre temperature), stl2 (Soil temperature level 2), stl3 (Soil temperature level 3), iews (Instantaneous eastward turbulent surface stress), inss (Instantaneous northward turbulent surface), stl4 (Soil temperature level 4), fsr (Forecast surface roughness), flsr (Forecast logarithm of surface roughness for heat), u100 (100 metre U wind component) and v100 (100 metre V wind component).
print(wind_ava.head())
print(wind_ava["year"].unique())
energy year month day hour p54.162.1 p54.162.2 p54.162.3 \
0 402.71 2005 1 2 18 2.534970e+06 2.526864e+06 2.518754e+06
1 696.80 2005 1 3 0 NaN NaN 2.521184e+06
2 1591.15 2005 1 3 6 2.533727e+06 2.525703e+06 2.517678e+06
3 1338.62 2005 1 3 12 NaN 2.526548e+06 2.518609e+06
4 562.50 2005 1 3 18 2.529543e+06 NaN 2.513702e+06
p54.162.4 p54.162.5 ... v100.16 v100.17 v100.18 v100.19 \
0 2.510648e+06 2.502537e+06 ... -4.683596 NaN -4.407196 NaN
1 2.513088e+06 NaN ... -3.397886 -3.257192 -3.115998 -2.975304
2 2.509654e+06 NaN ... -1.454105 NaN -1.138290 NaN
3 2.510670e+06 2.502732e+06 ... 1.255015 1.370265 1.485515 1.600765
4 2.505782e+06 2.497861e+06 ... 1.939031 NaN NaN 2.193977
v100.20 v100.21 v100.22 v100.23 v100.24 v100.25
0 -4.131295 -4.669626 -4.528932 -4.388736 -4.248540 -4.107846
1 -2.834609 -3.396390 -3.254198 -3.112506 -2.970314 NaN
2 -0.822476 -1.459094 -1.302933 -1.147271 -0.991110 -0.834949
3 1.716015 1.210612 1.319376 1.428140 1.536405 1.645169
4 2.278793 1.873673 1.953000 2.031829 2.111157 2.189986
[5 rows x 555 columns]
[2005 2006 2007 2008 2009]
Note that, since the instances are ordered chronologically, we cannot shuffle data when carrying out the partition for evaluation. For this reason, we will use time intervals of 1 year to make partitions, as we will explain later.
if all(pd.api.types.is_numeric_dtype(dtype) for dtype in wind_ava.dtypes):
print('All of the columns contain numerical data.')
else:
print('Not all columns are numerical.')
All of the columns contain numerical data.
From an analysis of the type of the data we have concluded that all the data are numerical.
missing_values_count = wind_ava.isnull().sum()
# Filering the columns with at least one na
missing_values_columns = missing_values_count[missing_values_count > 0]
# Printing the features with na and the number of na
print(missing_values_columns)
p54.162.1 372
p54.162.2 461
p54.162.3 263
p54.162.4 314
p54.162.5 489
...
v100.21 261
v100.22 387
v100.23 569
v100.24 579
v100.25 436
Length: 550, dtype: int64
Since the number of the columns with missing values is 550, starting with p54.162.1, and the data frame contains 555 columns, we conclude that all the attributes except for energy, year, month, day and hour contain missing values. This is a high number of attributes with missing values, but it is not concerning, since it is a plausible issue to happen in real life measures. In fact, it is reassuring that the dates at which the measures were taken are correctly recorded in the dataset, and also the energy produced, which is the key attribute to our study.
mean_na=np.mean(missing_values_count[missing_values_count > 0])
median_na=np.median(missing_values_count[missing_values_count > 0])
max_na=max(missing_values_count)
columns_with_max_na = missing_values_count[missing_values_count == max_na]
max_na_dict = columns_with_max_na.to_dict()
print('Median:',np.round(median_na,1),'na. Mean:',np.round(mean_na,1),
'na. Column with the maximum number of na:',max_na_dict,'na.')
median_na_prop=median_na/nrows*100
mean_na_prop=mean_na/nrows*100
max_na_dict[list(max_na_dict.keys())[0]] = np.round(max_na/nrows*100,1)
print('Median proportion of na:',np.round(median_na_prop,1),'%. Mean proportion of na:',
np.round(mean_na_prop,1),'%.')
print('Column with the maximum number of na:',max_na_dict,'%.')
Median: 587.5 na. Mean: 593.0 na. Column with the maximum number of na: {'lai_hv.13': 964} na.
Median proportion of na: 12.4 %. Mean proportion of na: 12.5 %.
Column with the maximum number of na: {'lai_hv.13': 20.3} %.
From this analysis, we have concluded that the median and the mean of columns containing NA (among the columns that contain NA) are very similar, having values of 587.5 and 593.0. The column 'lai_hv.13' (Leaf area index, high vegetation in location 13) contains the maximum number of na, 964, which is very high and constitutes more than 3/2 of the mean. In order to analize whether this number of na is significant or not, we have compared it with the number of rows, obtaining that the mean percentage of missing values (in each parameter that contains missing values) is 12.5%, and a median percentage of 12.4%. This number of NA is not high enough to discard any parameters, even the highest percentage of NA associated with the parameter 'lai_hv.13' is 20.3%, which is still plausible for an experimental measurement.
Some brief research in Leaf area index reveals that traditionally, researchers measured leaf area index by harvesting all the leaves from a plot and painstakingly measuring the area of each leaf. Modern equipment like flatbed scanners have made this process more efficient, but it is still labor intensive, time consuming, and destructive. This is explanatory for the high lack of data regarding this parameter in location 13. In particular, we have studied the 'lai_hv' columns, to see if many of them fall above the mean regarding the number of na:
columns_lai_hv_na= missing_values_count.filter(regex='lai_hv')[missing_values_count>mean_na]
print('Proportion of locations with a number of missing values above the mean regarding the parameter lai_hv:',len(columns_lai_hv_na)/25)
Proportion of locations with a number of missing values above the mean regarding the parameter lai_hv: 0.6
As expected, the 'lai_hv' parameter falls above the mean of missing values in 60% of the locations.
locations_na = []
for i in range(1, 26):
regex_pattern = f'\\.{i}$'
locations_na.append(np.mean(missing_values_count.filter(regex=regex_pattern)))
print('The location with the maximum number of na is location',np.argmax(locations_na) + 1,
', with',np.round(max(locations_na),1),'na.')
The location with the maximum number of na is location 3 , with 703.8 na.
According to our computations, location 3 has on average the most missing values of all locations taking into account all of the parameters.
nan_count_per_row = wind_ava.isna().sum(axis=1)
if all(nan_count_per_row > 0)==True:
print('All of the instances have NA.')
All of the instances have NA.
This result is quite revealing, since it forces to use missing value imputation when required.
na_count_per_row = wind_ava.isna().sum(axis=1)
max_na_row_index = na_count_per_row.idxmax()
print(f"The instance with most NA is the row of index {max_na_row_index}, having {max(na_count_per_row)} NA.")
The instance with most NA is the row of index 3946, having 97 NA.
We have also determined that the instance with most NA has 97 NA, but that is still not enough to discard that instance, since there are 550 features in total.
Now, we will analize the object of study: the energy feature, by computing its meand and standard deviation:
energy_sd=np.std(wind_ava['energy'])
energy_mean=np.mean(wind_ava['energy'])
print('The energy mean is',np.round(energy_mean,1),'and its standard deviation is',np.round(energy_sd,1),'.')
The energy mean is 693.1 and its standard deviation is 665.5 .
constant_columns = [col for col in wind_ava.columns if wind_ava[col].nunique() == 1]
if constant_columns:
print("Number of constant columns:", constant_columns)
else:
print("There are no constant columns in our DataFrame.")
There are no constant columns in our DataFrame.
In addition, we have determined that there are no constant columns in our data frame. Finally, we have also theorically determined that this is a case of regression (and not of classification), since the data that we have to predict are of type numerical and not categorical, and the aim is to find a function that transforms the input into the output.
Finally, we will split the data into two dataframes: X (containing the attributes used for the models) and y (containing only the output variable). This will ease up the code for the rest of the assignment.
y = wind_ava.iloc[:, 0] #output variable (labels)
X = wind_ava.iloc[:, 5:] #attributes
2. (0.3 points) Decide how you are going to carry out the outer evaluation (estimation of future performance) and the inner evaluation for comparing different alternatives. Decide which metric you are going to use. Provide justifications.
We have decided to carry out nested crossvalidation in order to reduce the bias in model evaluation and optimization that comes with holdout validation. Therefore, we will use 5-fold crossvalidation for the estimation of future performance and 4-fold crossvalidation for hyperparameter tuning.
The reason for choosing these particular partitions is the following: On the one hand, the dataset is quite large and therefore $\frac{1}{5}$ of the dataset is more than enough to test performance. This leaves $\frac{4}{5}$ of the data to train the model. However, on the other hand, the CASH problem is very computationally expensive, specially when performing nested crossvalidation. By reducing the number of folds we ensure that we reduce the bias while making the training less computationally heavy.
In addition to all this, we have chosen the partitions so that each fold contains a whole year. This is due to the nature of the different months and seasons of the year. If we took partitions of 6 months (as an example), we would find that the meteorological parameters in these partitions are not that similar, as we may find that summer is hotter or winter in windier. By choosing whole years, we ensure that the different partitions do not have this type of bias.
Regarding the evaluation and comparison of the models, we have decided to use the Root Mean Squared Error. Since this data is used to estimate the energy produced in a given region, we think it is crucial that the error is interpretable, in order to ease the selection of margins of error for the client. The client might want to hire some other company to provide the rest of the energy that renewable energy cannot. Thus, it would be very interesting to know the average error in the same units as the output variable.
In addition, outliers have a greater impact on the model, since these are the points at risk of falling out od the mentioned error margin. This means it is much worse to be off by 10 units than by 5, not only twice as worse. For this reason we will use RMSE over MAE.
def RMSE(pred, test):
return np.sqrt(np.mean((pred - test)**2))
3. (4.7 points) Main body of the assignment. The Sotavento company wants you to get some conclusions about the following issues. In order to come up with a conclusion, you will need to experiment with different alternatives and compare them with the inner evaluation.
a. (0.3 points) The following issues need not be dealt with in that order, nor be systematic. Please, provide an initial plan about the order in which you are going to explore these issues (although you can change your plan later).
We will now provide a brief outline of the following sections. Further explanation about each one can be found in the corresponding section.
Fisrt of all we will study whether all the variables are necessary in for the training of our model. Here there are two key issues to tackle. Using the data only at Sotavento or at the 25 locations provided, and selecting whether every of the 22 attributes provided for each location is useful.
Next, we will proceed to the imputation of missing values. We will test several methods to determine which yield better results (the details will be showcased in the corresponding section).
In the third place, we will proceed to the analysis of the CASH problem. We will train 4 different machine learning algorithms and compare their predictions to select the best possible model among them.
Then, we will explore the effect of introducing Stacking into the analysis. Using the 4 previously trained models, we will build a meta model and compare it with the best model obtained in the previous section, to verify whether stacking improves the predictions.
Finally, we will write a small report on how ChatGPT was used during the assignment and what advantages and disadvantages it brought.
c. (0.3 points) Are all the 550 attributes really necessary? Maybe only the attributes related to the Sotavento location (13th location in the grid) are actually required? Or only the attributes related to wind? (you are not expected to use automatic feature selection techniques here, rather, select features by hand)
Regarding the attributes relating to each of the 25 locations, we have decided to leave all of them in. Since neither of us have expertise in renewable energy production, we do not feel comfortable in selecting features by hand without prior testing.
About whether to keep data from all 25 locations, we will train 2 regression trees, one using only the 22 variables for the Sotavento location and another using all the variables. Then, we will compare the results and decide if the data from the different locations help with prediction.
Since the goal of this section is a simple comparison, we have chosen decision trees because they are faster to train compared with other models. Given that scikit-learn cannot handle missing values in decision trees, we will impute the missing values by the mean of each column.
In this same line, we have decided to use holdout validation for the outer evaluation and 4-fold crossvalidation for the inner evaluation, using random search for HPT. This split means that we will take all data in years 2005-2008 for training and year 2009 for testing.
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
## Outer evaluation split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =1/5,
shuffle = False)
## Inner evaluation split
inner = KFold(n_splits = 4, shuffle = False)
## Impute missing values
imputer = SimpleImputer(strategy = 'mean')
X_train_imp = imputer.fit_transform(X_train)
X_test_imp = imputer.fit_transform(X_test)
## Set the grid of hyperparameters
param_tree = {
"max_depth": range(5, 26, 2),
"min_samples_split": [2, 10, 14, 20, 30,] }
## Initialize the model
tree_reg = DecisionTreeRegressor(random_state = 100512068)
## Model with all 25 locations ##
rand_search_tree = RandomizedSearchCV(tree_reg, n_iter = 10,
cv = inner, param_distributions = param_tree,
scoring = "neg_root_mean_squared_error",
random_state = 100512068)
rand_search_tree.fit(X_train_imp, y_train)
test_predictions_25 = rand_search_tree.predict(X_test_imp)
test_RMSE_25 = RMSE(test_predictions_25, y_test)
## Model for Sotavento ##
X_train_sot = X_train_imp[:,12*22:13*22-1]
X_test_sot = X_test_imp[:,12*22:13*22-1]
rand_search_tree = RandomizedSearchCV(tree_reg, n_iter = 10,
cv = inner, param_distributions = param_tree,
scoring = "neg_root_mean_squared_error",
random_state = 100512068)
rand_search_tree.fit(X_train_sot, y_train)
test_predictions_sot = rand_search_tree.predict(X_test_sot)
test_RMSE_sot = RMSE(test_predictions_sot, y_test)
ratio = test_RMSE_25*100/test_RMSE_sot
print(f"RMSE using all available data: {round(test_RMSE_25, 2)}")
print(f"RMSE using only data for Sotavento: {round(test_RMSE_sot, 2)}")
print('The RMSE of all available data is',round(ratio,2),'% that of Sotavento.')
RMSE using all available data: 454.15 RMSE using only data for Sotavento: 520.39 The RMSE of all available data is 87.27 % that of Sotavento.
As we can infere from the comparison, data from the surrounding locations do help improving the error of the predictions. Specifically, the RMSE from using all the locations is approximately 90% that of using only the data from Sotavento. For this reason, we will carry out the rest of the assignment using all 550 variables.
d. (0.4 points) Does imputation improve performance in this problem? Which method seems to work best?
As we saw in the first section, this dataset contains several missing values in almost all of its columns. Given the fact that scikit-learn cannot handle missing values, and every row has at least 1 missing values, we need to select a method of imputation.
We will compare 3 different imputation methods; imputating with the mean, iterative imputation and KNN imputation. Analogously to the previous section, we will train 3 trees imputed with the different methods and use them to make predictions. Then, we will compare the RMSE of all of them to select the optimal imputer (note that the RMSE for the mean imputer has already been computed).
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer, IterativeImputer
## Model using iterative imputer ##
imputer = enable_iterative_imputer.IterativeImputer(random_state = 100512068)
X_train_imp1 = imputer.fit_transform(X_train)
X_test_imp1 = imputer.fit_transform(X_test)
## Set the grid of hyperparameters
param_tree = {
"max_depth": range(5, 26, 2),
"min_samples_split": [2, 10, 14, 20, 30,] }
## Train the model
tree_reg = DecisionTreeRegressor(random_state = 100512068)
rand_search_tree = RandomizedSearchCV(tree_reg, n_iter = 10,
cv = inner, param_distributions = param_tree,
scoring = "neg_root_mean_squared_error",
random_state = 100512068)
rand_search_tree.fit(X_train_imp1, y_train)
test_predictions_it = rand_search_tree.predict(X_test_imp1)
test_RMSE_it = RMSE(test_predictions_it, y_test)
## Model using KNN imputer ##
imputer = KNNImputer(n_neighbors = 5)
X_train_imp2 = imputer.fit_transform(X_train)
X_test_imp2 = imputer.fit_transform(X_test)
## Set the grid of hyperparameters
param_tree = {
"max_depth": range(5, 26, 2),
"min_samples_split": [2, 10, 14, 20, 30,] }
## Train the model
tree_reg = DecisionTreeRegressor(random_state = 100512068)
rand_search_tree = RandomizedSearchCV(tree_reg, n_iter = 10,
cv = inner, param_distributions = param_tree,
scoring = "neg_root_mean_squared_error",
random_state = 100512068)
rand_search_tree.fit(X_train_imp2, y_train)
test_predictions_knn = rand_search_tree.predict(X_test_imp2)
test_RMSE_knn = RMSE(test_predictions_knn, y_test)
print(f"RMSE imputating with the mean: {round(test_RMSE_25, 2)}")
print(f"RMSE using iterative imputation: {round(test_RMSE_it, 2)}")
print(f"RMSE using KNN imputation: {round(test_RMSE_knn, 2)}")
RMSE imputating with the mean: 348.27 RMSE using iterative imputation: 326.12 RMSE using KNN imputation: 351.87
The best score corresponds to the model trained imputing values using the iterative imputer. This method trains a model to impute every missing value using the rest of the values in the dataset. However, it is computationally much heavier than the other methods.
To conduct the study of all models in the CASH, we will perform an iterative imputation of all the data from the 550 attributes in the training set. Since we will be conducting inner and outer evaluation, the imputation should not be carried out with a certain train and test partition as we had previously done. The optimal approach would be to perform imputation year by year, but this would take a long time and the precision of the imputation would be affected by having few data points per year. For this reason, we have decided to perform an iterative imputation of the entire data matrix at the same time.
imputer = enable_iterative_imputer.IterativeImputer(random_state = 100512068)
X_imp = imputer.fit_transform(X)
np.savetxt('X_imputed.txt', X_imp)
np.savetxt('y.txt', y)
X = np.loadtxt('X_imputed.txt')
y = np.loadtxt('y.txt')
e. (2.6 points) Which method is more appropriate to the problem? (among trees, KNN, and two ensemble methods). Does hyper-parameter tuning contribute to improve performance? At what cost (compute time). Which HPO method does perform better? (among Random Search, Optuna, and Halving Search).
We will now comment the 5 methods that will be compared against each other by using the RMSE metric and provide a brief description for each one.
K-Nearest Neighbors (KNN): KNN is a non-parametric, instance-based, supervised learning algorithm used for classification and regression tasks. It operates based on the principle of similarity, where new instances are classified based on the majority class of their neighboring data points. KNN determines the class of a data point by examining the 'k' nearest neighbors in the training set, using distance metrics such as Euclidean, Manhattan, or Minkowski distances. KNN is simple to understand and implement but can be computationally expensive for large datasets as it requires storing all training data and calculating distances for predictions.
Gradient Boosting: This method builds an ensemble of decision trees in a sequential manner, where each subsequent tree is built to correct the errors made by the previous ones. Gradient boosting regression involves optimizing a loss function, and it generally provides high accuracy in predictive modeling. This method can be quite powerful but may be more prone to overfitting if not configured correctly. It is often used in machine learning competitions for its effectiveness in handling a variety of data types and distributions.
Decision Trees: Decision Trees are versatile supervised learning algorithms used for both classification and regression tasks. They recursively split the feature space into distinct regions by asking a series of binary questions based on feature attributes. At each node of the tree, the algorithm selects the most informative feature to make the split that best separates the data. This process continues until a stopping criterion is met, such as reaching a maximum depth or minimum samples per leaf. Decision Trees are interpretable models, this is, they provide transparent insights into the decision-making process and are valuable for understanding feature importance within a dataset. However, decision trees have a tendency to overfit data.
Random Forest: This is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees. Random forest regression is effective in reducing overfitting and is capable of handling large datasets with higher dimensionality. It works well for both categorical and continuous input and output variables.
Extra trees: Extra Trees Regression is an ensemble machine learning algorithm that belongs to the family of decision tree algorithms. It stands for Extremely Randomized Trees Regression. The method works by constructing a multitude of decision trees at training time and outputting the average prediction of the individual trees. Unlike traditional decision trees, which choose the most discriminative thresholds for splitting features, Extra Trees Regression adds randomness to the model in two ways: it selects random subsets of the features at each split in the decision tree, and it also uses random thresholds for each feature rather than searching for the best possible thresholds. This randomness helps to increase the model's robustness and can prevent overfitting, often resulting in better generalization to new data. Additionally, due to its simplicity and the fact that it requires very little parameter tuning, Extra Trees Regression is an attractive option for many regression tasks.
For each of these models, we will get the estimation of future performance by training models without hyperparameter optimization, as well as two methods of HPO: random search and successive halving. Once we have estimations for the best model of each kind, we will compare among them and choose the one with the best estimation of future performance (taking into account the time taken by each method). Since the computations for all of these methods can be heavy, we have decided to perform them in a separate script and then import the results. These scripts generate a dictionary with the trained models, RMSE score and computation time for each of the three mentioned methods of setting hyperparameters.
We have saved all of the information regarding these models in the cashmodels folder.
Let us first start with KNN. We have computed the model compiling the file $\texttt{knn.py}$ and we have saved the model in the file $\texttt{KNN_models.pkl}$. We have chosen a grid of parameters consisting on 22 values for $k$ and three values for the metric, making a total of 66 parameter combinations. The values for the number of neighbours include all odd numbers from 5 to 49 (we only included odd numbers so that there are no ties when choosing the majority class). The 3 possible metrics considered were Euclidean, Manhattan and Chebyshev distances (particular examples of p-norm where $p=1$, $p=2$ and $p\to\infty$ respectively).
import joblib
# Load the .pkl file
knn = joblib.load('cashmodels/KNN_models.pkl')
results = [
{
'Model': 'No HPT',
'RMSE Score': np.round(np.mean(knn["KNN_def"]["Scores"]), 2),
'Time (s)': np.round(knn["KNN_def"]["Times"][0], 2)
},
{
'Model': 'HPT using Random Search',
'RMSE Score': np.round(np.mean(knn["KNN_rand"]["Scores"]), 2),
'Time (s)': np.round(knn["KNN_rand"]["Times"][0], 2)
},
{
'Model': 'HPT using Halving Search',
'RMSE Score': np.round(np.mean(knn["KNN_halving"]["Scores"]), 2),
'Time (s)': np.round(knn["KNN_halving"]["Times"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 No HPT 431.13 0.36 1 HPT using Random Search 408.13 269.36 2 HPT using Halving Search 410.69 105.34
Among these three models, we can clearly see that the one without HPT has the worst predictions. Between Random Search and Halving Search there is very little difference, but it is worth noting that Halving Search is almost 3 times faster than Random Search. In contrast, the model without HPT is trained almost 1000 faster than Random Search.
To contrast the obtained results, let us see how the RMSE of a model with HPT (using random search, for example) changes over the 5 different outer folds. This is, the predicted RMSE for each of the 5 different train/test partitions.
for element in knn["KNN_rand"]["Scores"]:
print(np.round(element,2))
392.62 421.61 375.03 438.48 412.9
From these scores we can extract two different conclusions. First of all, it is necessary to do outer evaluation as well to obtain a good estimation of future performance, since there can be a big difference in RMSE depending on the particular partition chosen (as evidenced in the code above). Secondly, if we have two methods where the difference in RMSE is very small compared with these fluctuations, we cannot confidently conclude that one will predict better than the other.
We have finally decided to use the Random Search method for this model, since it improves the performance and only takes 2.5 times more time.
Secondly, we will study the performance of gradient boosting using trees. We will use a hyperparameter grid consinting on 20 possible values for the learning rate, 3 for the minimum sample split and 10 values for the maximum tree depth. The learning rate ranges from 0.01 to 0.1 in steps of 0.005, while the maximum depth ranges from 5 to 25 in steps of 2. The possible minimum sample splits are 5, 15 and 30.
Note that HistGradientBoost does not have a parameter for the number of estimators (trees), and calculates it automatically based on the learning rate. We are also using early stopping an a maximum number of iterations of 300. Since we are using RMSE for our similarity metric, the loss function is the default setting, this is, MSE.
We have computed the model compiling the file $\texttt{gradientboosting.py}$ and we have saved the model in the file $\texttt{GradientBoosting_models.pkl}$.
# Load the .pkl file
gb = joblib.load('cashmodels/GradientBoosting_models.pkl')
results = [
{
'Model': 'No HPT',
'RMSE Score': np.round(np.mean(gb["gb_def"]["Scores"]), 2),
'Time (s)': np.round(gb["gb_def"]["Time"][0], 2)
},
{
'Model': 'HPT using Random Search',
'RMSE Score': np.round(np.mean(gb["gb_rand"]["Scores"]), 2),
'Time (s)': np.round(gb["gb_rand"]["Time"][0], 2)
},
{
'Model': 'HPT using Halving Search',
'RMSE Score': np.round(np.mean(gb["gb_halving"]["Scores"]), 2),
'Time (s)': np.round(gb["gb_halving"]["Time"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 No HPT 372.55 11.48 1 HPT using Random Search 368.35 958.96 2 HPT using Halving Search 368.76 2394.41
We can see that both methods of hyperparameter tuning yield slightly better results than the one with no HPT, but this improvement is too small to actually be taken as such. However, given that both methods of HPT take more than 80 times more computation time when compared to the default one, we have decided that the default hyperparameters are the most efficient option for this model.
For Decision Trees we have chosen a grid of parameters consisting on 20 values for maximum depth and 39 values for the minimum split, making a total of 780 possible parameter combinations. The values for the maximum depth include all numbers from 10 to 29. The values for the minimum split include all numbers from 2 to 39. We have computed the model compiling the file $\texttt{dectrees.py}$ and we have saved the model in the file $\texttt{dectrees_models.pkl}$.
# Load the .pkl file
trees = joblib.load('cashmodels/dectrees_models.pkl')
results = [
{
'Model': 'No HPT',
'RMSE Score': np.round(np.mean(trees["trees_def"]["Scores"]), 2),
'Time (s)': np.round(trees["trees_def"]["Times"][0], 2)
},
{
'Model': 'HPT using Random Search',
'RMSE Score': np.round(np.mean(trees["trees_rand"]["Scores"]), 2),
'Time (s)': np.round(trees["trees_rand"]["Times"][0], 2)
},
{
'Model': 'HPT using Halving Search',
'RMSE Score': np.round(np.mean(trees["trees_halving"]["Scores"]), 2),
'Time (s)': np.round(trees["trees_halving"]["Times"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 No HPT 498.18 10.53 1 HPT using Random Search 434.50 504.38 2 HPT using Halving Search 447.51 124.04
Clearly the prediction models via HPT improve with respect to the one with no HPT, with an improvement in the Random Search method with respect to the Halving method. Computationally, the HPT methods are more than 10 times more time-consuming than the default method. Even though the Random Search method improves the performance by 13 points with respect to the Halving Search method, it is 4 times more time consuming. However, since it is a significant improvement, we have decided that the Random Search method is the best HPT method for this model.
For Random Forest we have not performed hyperparameter tuning since a first estimation of the cost of performing the default evaluation and some first iterations of the Random Search method with a grid of hyperparameters has lead us to conclude that it was too time consuming to perform hyperparameter tuning with this method. We have computed the model compiling the file $\texttt{randomforest.py}$ and we have saved the model in the file $\texttt{rf_models.pkl}$.
The range in which we considered evaluating the Random Forest model was the following:
param_rf = {
"max_depth": range(10, 100, 3),
'n_estimators': range(100, 1001, 50),
"min_samples_split": range(2, 11, 1)}
However, after computing the default model (which lasted 705.85 s) and computing a first iteration (of the outer evaluation) which was not computed even during 4 hours, it was clear that we could not afford to perform inner and outer crossvalidation altogether with hyperparameter tuning for the Random Forest model.
# Load the .pkl file
rf = joblib.load('cashmodels/rf_models.pkl')
results = [
{
'Model': 'No HPT',
'RMSE Score': np.round(np.mean(rf["rf_def"]["Scores"]), 2),
'Time (s)': np.round(rf["rf_def"]["Times"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 No HPT 369.69 705.85
For Extra Trees we have chosen the same grid of hyperparameters as for Decision Trees. We know that it is advised to change it and also to include another important hyperparameter, the number of estimators. However, the default Extra Trees model already performed very well by itself, but also taking a lot of time, so we did not want to scale the time innecesarily. This is why we have chosen a grid of parameters consisting on 20 values for maximum depth and 39 values for the minimum split, making a total of 780 possible parameter combinations. The values for the maximum depth include all numbers from 10 to 29. The values for the minimum split include all numbers from 2 to 39. We have computed the model compiling the file $\texttt{extratrees.py}$ and we have saved the model in the file $\texttt{extra_models.pkl}$.
# Load the .pkl file
et = joblib.load('cashmodels/extratrees1_models.pkl')
results = [
{
'Model': 'No HPT',
'RMSE Score': np.round(np.mean(et["extra_def"]["Scores"]), 2),
'Time (s)': np.round(et["extra_def"]["Times"][0], 2)
},
{
'Model': 'HPT using Random Search',
'RMSE Score': np.round(np.mean(et["extra_rand"]["Scores"]), 2),
'Time (s)': np.round(et["extra_rand"]["Times"][0], 2)
},
{
'Model': 'HPT using Halving Search',
'RMSE Score': np.round(np.mean(et["extra_halving"]["Scores"]), 2),
'Time (s)': np.round(et["extra_halving"]["Times"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 No HPT 376.79 135.32 1 HPT using Random Search 371.75 4037.07 2 HPT using Halving Search 374.26 1937.51
The hipertuning methods improve the RMSE by a little, but the cost in time (up to 10 times higher and more) is not worth the slight improvement.
After having considering the best hiperparameter tuning option for each model, we will now compare the models by the rmse metric in order to finally select the best option and compute the final model.
results = [
{
'Model': 'KNN using Random Search',
'RMSE Score': np.round(np.mean(knn["KNN_rand"]["Scores"]), 2),
'Time (s)': np.round(knn["KNN_rand"]["Times"][0], 2)
},
{
'Model': 'Gradient Boosting, No HPT',
'RMSE Score': np.round(np.mean(gb["gb_def"]["Scores"]), 2),
'Time (s)': np.round(gb["gb_def"]["Time"][0], 2)
},
{
'Model': 'Decision Trees using Random Search',
'RMSE Score': np.round(np.mean(trees["trees_rand"]["Scores"]), 2),
'Time (s)': np.round(trees["trees_rand"]["Times"][0], 2)
},
{
'Model': 'Random Forest, No HPT',
'RMSE Score': np.round(np.mean(rf["rf_def"]["Scores"]), 2),
'Time (s)': np.round(rf["rf_def"]["Times"][0], 2)
},
{
'Model': 'Extra Trees, No HPT',
'RMSE Score': np.round(np.mean(et["extra_def"]["Scores"]), 2),
'Time (s)': np.round(et["extra_def"]["Times"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 KNN using Random Search 408.13 269.36 1 Gradient Boosting, No HPT 372.55 11.48 2 Decision Trees using Random Search 434.50 504.38 3 Random Forest, No HPT 369.69 705.85 4 Extra Trees, No HPT 376.79 135.32
In terms of performance with respect to time cost the best model is the default version of Gradient Boosting. It is the second on performance and the best in timing. The best model in terms of performance is the default version of Random Forest.
f. (0.5 points) Try something on your own: a new library, explore an issue you are interested in, etc.
For this section of the assignment we have decided to test whether stacking improves the prediction capabilities of the previously trained models. Given the nature of our problem, we will use a simple model such as Linear Regression as the meta-model to combine the predictions of the other four more complex models. After getting the estimation of future performance with the stacked model, we will compare it against the best of the models obtained in the previous section. We have computed the model compiling the file $\texttt{stacking.py}$ and we have saved the model in the file $\texttt{Stacking_models.pkl}$. In order to compute the stacking, we used the KNN model and the Decision Trees model with the Random Search method (as it is the best in performance and doesn't take that much time) and the random forest, the gradient boosting and the extra trees models with the default hyperparameters (as they are the best in performance).
# Load the .pkl file
stck = joblib.load('cashmodels/Stacking_models.pkl')
results = [
{
'Model': 'Stacking with Linear Regression',
'RMSE Score': np.round(np.mean(stck["Scores"]), 2),
'Time (s)': np.round(stck["Time"][0], 2)
}
]
format_results = pd.DataFrame(results)
print(format_results)
Model RMSE Score Time (s) 0 Stacking with Linear Regression 366.79 4197.72
We can see that this model performs slightly better than any of the previous ones in terms of performance. However, it is also the most expensive model to train, as it requires training the 5 base models as well as the stacked one, which itself takes more time than any of the previous. In this CASH study we have performed outer crossvalidation evaluation, which is very time costly. This leads us to believe that in the final model, where we won't perform outer crossvalidation, the time cost will be much lower and it is possible to compute the stacking model as the final model, which we have chosen since it appears to have the best performance.
b. (0.6 points) Tell how you have used ChatGPT in this assignment. You can describe a summary of your experience with ChatGPT in the context of the assignment, important prompts you have used, some cases where you found out that ChatGPT was wrong, etc. No more than 3 pages in the report
In first place, we have asked chatgpt how to directly display images in a jupyter notebook. It must be noted that I talk to chatgpt in Spanish, since it is my mother tongue.
We have saved all of the information regarding these models in the chatgptimages folder.
from IPython.display import Image
Image(filename="chatgptimages/imagesjupyter.png", width=600, height=400)
We have used chatgpt to edit our code and set timers and print the number of the iteration of the outer crossvalidation that our code is performing, in order to control the compilation of our codes:
Image(filename="chatgptimages/codeedition1.png", width=600, height=400)
We have asked chatgpt if the range in which we planned to evaluate the random forest model was appropiate for our data or not:
Image(filename="chatgptimages/hprf1.png", width=600, height=400)
Image(filename="chatgptimages/hprf2.png", width=600, height=400)
However, in the end we only used this information once, to perform one iteration of the Random Search method in the Random Forest model and conclude that it was not possible to compute that level of hyperparameter tuning with that model, due to the large amount of time that it took (it did not even finish one iteration after 4 hours).
We have asked chatgpt which are the possible hyperparameters in which we can evaluate the Extra Trees model:
Image(filename="chatgptimages/hpextratrees1.png", width=600, height=400)
Image(filename="chatgptimages/hpextratrees2.png", width=600, height=400)
However, finally we opted for using the same grid of hyperparameters that we already used for the Decision Trees model, since we already had an estimation of time for that grid and it was possible to escalate the time taken once we had the default Extra Trees model.
We also used chatgpt to load the .pkl files, since we did not know how to load them to python. However, chatgpt failed in this part at first, and it did not have a great ability to correct its mistakes:
Image(filename="chatgptimages/pklproblem1.png", width=600, height=400)
Image(filename="chatgptimages/pklproblem2.png", width=600, height=400)
Image(filename="chatgptimages/pklproblem3.png", width=600, height=400)
Image(filename="chatgptimages/pklproblem4.png", width=600, height=400)
Image(filename="chatgptimages/pklproblem5.png", width=600, height=400)
Image(filename="chatgptimages/pklproblem6.png", width=600, height=400)
4. (0.6 points) Once you have decided on the best alternative (based on comparing different alternatives using the inner evaluation):
a. Using the best alternative (based on inner evaluation), make an estimation of the accuracy that the final model might get on future data (outer evaluation).
We already performed this analysis in the section 3. f).
b. Train the final model and use it to make predictions on the “competition data”. Save both the final model and the competition predictions on files.
In first place, we will load the $\texttt{wind_competition}$ dataset, and study the data that is provided and whether it contains any NA and it needs an iterative imputation.
test = pd.read_csv('wind_competition.csv.gzip', compression='gzip')
nrows = test.shape[0]
ncol = test.shape[1]
print('There are',nrows,'instances and',ncol,'features.')
column_names_df = pd.DataFrame(test.columns)
print(column_names_df.to_string())
There are 1189 instances and 554 features.
0
0 year
1 month
2 day
3 hour
4 p54.162.1
5 p54.162.2
6 p54.162.3
7 p54.162.4
8 p54.162.5
9 p54.162.6
10 p54.162.7
11 p54.162.8
12 p54.162.9
13 p54.162.10
14 p54.162.11
15 p54.162.12
16 p54.162.13
17 p54.162.14
18 p54.162.15
19 p54.162.16
20 p54.162.17
21 p54.162.18
22 p54.162.19
23 p54.162.20
24 p54.162.21
25 p54.162.22
26 p54.162.23
27 p54.162.24
28 p54.162.25
29 p55.162.1
30 p55.162.2
31 p55.162.3
32 p55.162.4
33 p55.162.5
34 p55.162.6
35 p55.162.7
36 p55.162.8
37 p55.162.9
38 p55.162.10
39 p55.162.11
40 p55.162.12
41 p55.162.13
42 p55.162.14
43 p55.162.15
44 p55.162.16
45 p55.162.17
46 p55.162.18
47 p55.162.19
48 p55.162.20
49 p55.162.21
50 p55.162.22
51 p55.162.23
52 p55.162.24
53 p55.162.25
54 cape.1
55 cape.2
56 cape.3
57 cape.4
58 cape.5
59 cape.6
60 cape.7
61 cape.8
62 cape.9
63 cape.10
64 cape.11
65 cape.12
66 cape.13
67 cape.14
68 cape.15
69 cape.16
70 cape.17
71 cape.18
72 cape.19
73 cape.20
74 cape.21
75 cape.22
76 cape.23
77 cape.24
78 cape.25
79 p59.162.1
80 p59.162.2
81 p59.162.3
82 p59.162.4
83 p59.162.5
84 p59.162.6
85 p59.162.7
86 p59.162.8
87 p59.162.9
88 p59.162.10
89 p59.162.11
90 p59.162.12
91 p59.162.13
92 p59.162.14
93 p59.162.15
94 p59.162.16
95 p59.162.17
96 p59.162.18
97 p59.162.19
98 p59.162.20
99 p59.162.21
100 p59.162.22
101 p59.162.23
102 p59.162.24
103 p59.162.25
104 lai_lv.1
105 lai_lv.2
106 lai_lv.3
107 lai_lv.4
108 lai_lv.5
109 lai_lv.6
110 lai_lv.7
111 lai_lv.8
112 lai_lv.9
113 lai_lv.10
114 lai_lv.11
115 lai_lv.12
116 lai_lv.13
117 lai_lv.14
118 lai_lv.15
119 lai_lv.16
120 lai_lv.17
121 lai_lv.18
122 lai_lv.19
123 lai_lv.20
124 lai_lv.21
125 lai_lv.22
126 lai_lv.23
127 lai_lv.24
128 lai_lv.25
129 lai_hv.1
130 lai_hv.2
131 lai_hv.3
132 lai_hv.4
133 lai_hv.5
134 lai_hv.6
135 lai_hv.7
136 lai_hv.8
137 lai_hv.9
138 lai_hv.10
139 lai_hv.11
140 lai_hv.12
141 lai_hv.13
142 lai_hv.14
143 lai_hv.15
144 lai_hv.16
145 lai_hv.17
146 lai_hv.18
147 lai_hv.19
148 lai_hv.20
149 lai_hv.21
150 lai_hv.22
151 lai_hv.23
152 lai_hv.24
153 lai_hv.25
154 u10n.1
155 u10n.2
156 u10n.3
157 u10n.4
158 u10n.5
159 u10n.6
160 u10n.7
161 u10n.8
162 u10n.9
163 u10n.10
164 u10n.11
165 u10n.12
166 u10n.13
167 u10n.14
168 u10n.15
169 u10n.16
170 u10n.17
171 u10n.18
172 u10n.19
173 u10n.20
174 u10n.21
175 u10n.22
176 u10n.23
177 u10n.24
178 u10n.25
179 v10n.1
180 v10n.2
181 v10n.3
182 v10n.4
183 v10n.5
184 v10n.6
185 v10n.7
186 v10n.8
187 v10n.9
188 v10n.10
189 v10n.11
190 v10n.12
191 v10n.13
192 v10n.14
193 v10n.15
194 v10n.16
195 v10n.17
196 v10n.18
197 v10n.19
198 v10n.20
199 v10n.21
200 v10n.22
201 v10n.23
202 v10n.24
203 v10n.25
204 sp.1
205 sp.2
206 sp.3
207 sp.4
208 sp.5
209 sp.6
210 sp.7
211 sp.8
212 sp.9
213 sp.10
214 sp.11
215 sp.12
216 sp.13
217 sp.14
218 sp.15
219 sp.16
220 sp.17
221 sp.18
222 sp.19
223 sp.20
224 sp.21
225 sp.22
226 sp.23
227 sp.24
228 sp.25
229 stl1.1
230 stl1.2
231 stl1.3
232 stl1.4
233 stl1.5
234 stl1.6
235 stl1.7
236 stl1.8
237 stl1.9
238 stl1.10
239 stl1.11
240 stl1.12
241 stl1.13
242 stl1.14
243 stl1.15
244 stl1.16
245 stl1.17
246 stl1.18
247 stl1.19
248 stl1.20
249 stl1.21
250 stl1.22
251 stl1.23
252 stl1.24
253 stl1.25
254 u10.1
255 u10.2
256 u10.3
257 u10.4
258 u10.5
259 u10.6
260 u10.7
261 u10.8
262 u10.9
263 u10.10
264 u10.11
265 u10.12
266 u10.13
267 u10.14
268 u10.15
269 u10.16
270 u10.17
271 u10.18
272 u10.19
273 u10.20
274 u10.21
275 u10.22
276 u10.23
277 u10.24
278 u10.25
279 v10.1
280 v10.2
281 v10.3
282 v10.4
283 v10.5
284 v10.6
285 v10.7
286 v10.8
287 v10.9
288 v10.10
289 v10.11
290 v10.12
291 v10.13
292 v10.14
293 v10.15
294 v10.16
295 v10.17
296 v10.18
297 v10.19
298 v10.20
299 v10.21
300 v10.22
301 v10.23
302 v10.24
303 v10.25
304 t2m.1
305 t2m.2
306 t2m.3
307 t2m.4
308 t2m.5
309 t2m.6
310 t2m.7
311 t2m.8
312 t2m.9
313 t2m.10
314 t2m.11
315 t2m.12
316 t2m.13
317 t2m.14
318 t2m.15
319 t2m.16
320 t2m.17
321 t2m.18
322 t2m.19
323 t2m.20
324 t2m.21
325 t2m.22
326 t2m.23
327 t2m.24
328 t2m.25
329 stl2.1
330 stl2.2
331 stl2.3
332 stl2.4
333 stl2.5
334 stl2.6
335 stl2.7
336 stl2.8
337 stl2.9
338 stl2.10
339 stl2.11
340 stl2.12
341 stl2.13
342 stl2.14
343 stl2.15
344 stl2.16
345 stl2.17
346 stl2.18
347 stl2.19
348 stl2.20
349 stl2.21
350 stl2.22
351 stl2.23
352 stl2.24
353 stl2.25
354 stl3.1
355 stl3.2
356 stl3.3
357 stl3.4
358 stl3.5
359 stl3.6
360 stl3.7
361 stl3.8
362 stl3.9
363 stl3.10
364 stl3.11
365 stl3.12
366 stl3.13
367 stl3.14
368 stl3.15
369 stl3.16
370 stl3.17
371 stl3.18
372 stl3.19
373 stl3.20
374 stl3.21
375 stl3.22
376 stl3.23
377 stl3.24
378 stl3.25
379 iews.1
380 iews.2
381 iews.3
382 iews.4
383 iews.5
384 iews.6
385 iews.7
386 iews.8
387 iews.9
388 iews.10
389 iews.11
390 iews.12
391 iews.13
392 iews.14
393 iews.15
394 iews.16
395 iews.17
396 iews.18
397 iews.19
398 iews.20
399 iews.21
400 iews.22
401 iews.23
402 iews.24
403 iews.25
404 inss.1
405 inss.2
406 inss.3
407 inss.4
408 inss.5
409 inss.6
410 inss.7
411 inss.8
412 inss.9
413 inss.10
414 inss.11
415 inss.12
416 inss.13
417 inss.14
418 inss.15
419 inss.16
420 inss.17
421 inss.18
422 inss.19
423 inss.20
424 inss.21
425 inss.22
426 inss.23
427 inss.24
428 inss.25
429 stl4.1
430 stl4.2
431 stl4.3
432 stl4.4
433 stl4.5
434 stl4.6
435 stl4.7
436 stl4.8
437 stl4.9
438 stl4.10
439 stl4.11
440 stl4.12
441 stl4.13
442 stl4.14
443 stl4.15
444 stl4.16
445 stl4.17
446 stl4.18
447 stl4.19
448 stl4.20
449 stl4.21
450 stl4.22
451 stl4.23
452 stl4.24
453 stl4.25
454 fsr.1
455 fsr.2
456 fsr.3
457 fsr.4
458 fsr.5
459 fsr.6
460 fsr.7
461 fsr.8
462 fsr.9
463 fsr.10
464 fsr.11
465 fsr.12
466 fsr.13
467 fsr.14
468 fsr.15
469 fsr.16
470 fsr.17
471 fsr.18
472 fsr.19
473 fsr.20
474 fsr.21
475 fsr.22
476 fsr.23
477 fsr.24
478 fsr.25
479 flsr.1
480 flsr.2
481 flsr.3
482 flsr.4
483 flsr.5
484 flsr.6
485 flsr.7
486 flsr.8
487 flsr.9
488 flsr.10
489 flsr.11
490 flsr.12
491 flsr.13
492 flsr.14
493 flsr.15
494 flsr.16
495 flsr.17
496 flsr.18
497 flsr.19
498 flsr.20
499 flsr.21
500 flsr.22
501 flsr.23
502 flsr.24
503 flsr.25
504 u100.1
505 u100.2
506 u100.3
507 u100.4
508 u100.5
509 u100.6
510 u100.7
511 u100.8
512 u100.9
513 u100.10
514 u100.11
515 u100.12
516 u100.13
517 u100.14
518 u100.15
519 u100.16
520 u100.17
521 u100.18
522 u100.19
523 u100.20
524 u100.21
525 u100.22
526 u100.23
527 u100.24
528 u100.25
529 v100.1
530 v100.2
531 v100.3
532 v100.4
533 v100.5
534 v100.6
535 v100.7
536 v100.8
537 v100.9
538 v100.10
539 v100.11
540 v100.12
541 v100.13
542 v100.14
543 v100.15
544 v100.16
545 v100.17
546 v100.18
547 v100.19
548 v100.20
549 v100.21
550 v100.22
551 v100.23
552 v100.24
553 v100.25
test.head()
| year | month | day | hour | p54.162.1 | p54.162.2 | p54.162.3 | p54.162.4 | p54.162.5 | p54.162.6 | ... | v100.16 | v100.17 | v100.18 | v100.19 | v100.20 | v100.21 | v100.22 | v100.23 | v100.24 | v100.25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | 1 | 1 | 0 | 2.403131e+06 | 2.395445e+06 | 2.387755e+06 | 2.380065e+06 | 2.372380e+06 | 2.399548e+06 | ... | 7.212586 | 7.057422 | 6.901760 | NaN | 6.591434 | 7.184147 | 7.030980 | 6.877313 | 6.723647 | 6.570479 |
| 1 | 2010 | 1 | 1 | 6 | 2.410306e+06 | 2.402394e+06 | NaN | 2.386571e+06 | 2.378660e+06 | NaN | ... | 0.207289 | 0.583972 | NaN | NaN | 1.714518 | 0.345988 | 0.723170 | 1.100850 | 1.478031 | 1.855712 |
| 2 | 2010 | 1 | 1 | 12 | 2.434908e+06 | 2.426793e+06 | 2.418683e+06 | 2.410573e+06 | 2.402462e+06 | 2.431465e+06 | ... | 1.670114 | 1.691568 | 1.712522 | 1.733976 | NaN | 1.664127 | 1.682587 | 1.700548 | NaN | 1.736470 |
| 3 | 2010 | 1 | 1 | 18 | 2.447112e+06 | NaN | 2.431027e+06 | 2.422984e+06 | 2.414942e+06 | 2.443696e+06 | ... | 1.217597 | 1.278464 | 1.339332 | 1.399701 | 1.460569 | 1.215102 | 1.272477 | 1.329853 | NaN | 1.444105 |
| 4 | 2010 | 1 | 2 | 0 | 2.459695e+06 | NaN | 2.443809e+06 | 2.435866e+06 | NaN | 2.456252e+06 | ... | 3.755089 | 3.686738 | NaN | 3.549536 | 3.481184 | 3.781532 | 3.710686 | NaN | 3.569492 | 3.498646 |
5 rows × 554 columns
print(test["year"].unique())
[2010]
As we can easily observe, the test dataset contains all the features that the wind_ava dataset contained, including the dates, and it only lacks the objective column, the energy data, that we have to guess. The dataset contains observations of the year 2010 only. Now we study whether there are any NA in this dataset:
missing_values_count = test.isnull().sum()
# Filtrar las columnas que tienen al menos un valor nulo
missing_values_columns = missing_values_count[missing_values_count > 0]
# Imprimir las características con valores faltantes y cuántos tienen
print(missing_values_columns)
p54.162.1 99
p54.162.2 113
p54.162.3 71
p54.162.4 68
p54.162.5 138
...
v100.21 65
v100.22 107
v100.23 147
v100.24 152
v100.25 119
Length: 550, dtype: int64
As in the previous dataset, we find NA values in all of the attributes, except for the dates. Hence, we perform an iterative imputation on these data:
X_test=test.iloc[:,4:]
imputer = enable_iterative_imputer.IterativeImputer(random_state = 100512068)
X_test = imputer.fit_transform(X_test)
np.savetxt('X_test.txt', X_test)
X_test=np.loadtxt('X_test.txt')
Using as X_train and y_train the whole $\texttt{wind_ava}$ dataset (with the iterative imputation that we already computed for the dataset), we have trained all of the final models that we will use to train the stacking final model. To this end, we have compiled the files $\texttt{finalmodelknn.py}$, $\texttt{finalmodeltrees.py}$, $\texttt{finalmodelrf.py}$, $\texttt{finalmodelgb.py}$ and $\texttt{finalmodelextra.py}$, where we have computed the chosen hyperparameter tuning version for each method (the default method for the extra trees, random forest and gradient boosting models and the random search method with the corresponding hyperparameters for the knn and decision trees models). The computed final models have been saved in the files $\texttt{FinalModelKnn.pkl}$, $\texttt{FinalModelTrees.pkl}$, $\texttt{FinalModelGb.pkl}$, $\texttt{FinalModelRf.pkl}$ and $\texttt{FinalModelExtra.pkl}$. All of these files are saved in the $\texttt{finalmodels}$ folder.
Using all of these models, we have finally computed the stacking final model using the file $\texttt{finalmodelstacking.py}$ and saving the model in $\texttt{FinalModel.pkl}$. The stacking final model has taken 1095 s to be compiled, in addition to the computation time of all of the other models, which sums up to about 500 s. While computing the final model we have also computed the predictions to the $\texttt{wind_competition}$ dataset, for which we have used the imputed X_test matrix, and we have saved them in the file $\texttt{FinalPredictions.txt}$.
We will now perform a small analysis on the predictions.
finalpreds=np.loadtxt('finalmodels/FinalPredictions.txt')
print('The mean of the final predictions is',np.round(np.mean(finalpreds),2),'.')
The mean of the final predictions is 668.02 .
This mean is not too far away form the mean of our training data, which was 693.1. This may indicate consistency in the predictions. However, it is lower.
We will now plot the energy predictions with respect to the number of the observations to observe the evolution of the energy during the year 2010. In order to understand better the graph, we have used as labels the month corresponding to each instance, but it must be noted that we have not plotted energy vs month.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({
'FinalPreds': finalpreds,
'Month': test['month']
})
plt.figure(figsize=(6, 4))
plt.plot(df.index, df['FinalPreds'], marker=None, linewidth=1)
xticks_positions = [i for i in range(0, len(df['Month']), len(df['Month']) // 5)]
xticks_labels = [df['Month'][i] for i in xticks_positions]
plt.xticks(xticks_positions, xticks_labels)
plt.xlabel('Instance index (Month)')
plt.ylabel('Energy')
plt.title('Energy vs. Instance index, during 2010')
plt.grid(True)
plt.show()
The graph shows exactly the monthly evolution of eolic energy that we would expect: more wind during the fall and spring, a little less during winter and much less during the summer.